DOCUMENT RESUME 



ED 244 343 



EA D15 771 



AUTHOR 
TITLE 



INSTITUTION 



SPONS AGENCY 

PUB DATE 
CONTRACT 
NOTE 

PUB TYPE 

EDRS;PRICE^ 
DESCRIPTORS 



IDENTIFIERS 



ABSTRACT 



Keesling, J . Ward 

Differences between Fall-t o-Spr i ng and Annual Gains 
in Evaluation of Chapter 1 Programs. 
Advanced Technology, Inc., McLean, Va.; Education 
Analysis Center for State and Local Grants (ED), 
.Washington, DC. _ _ _^ 

Department of Education , Washington , DC . Office of 
Planning, Budget , and EvaJ.ua t i on . 

Jan 84 

300-82-0380 

35p.; Prepared for the Planning arid Evaluation 
Service . 

Reports - Research/Technical ( 143 ) 
MF01/PC02 Plus Postage,. 

♦Educationally Disadvantaged; Elementary Secondary 
Education ; *Pre tests Post tests; * Standard i zed tests ; 
Testing Problenfs; Testing Programs; test Norms; *test 
Reliability; test Results; test Use; *test 

Validity \ 

Education Consolidation Improvement Act Chapter 1; 



Elementary Secondary Education Act Title I 
Evaluation and Reporting System 



*Title I 



/ the Title I Evaluation arid Reporting System (TIERS) 

was developed in order to examine the extent to which Title I (now 
Chapter 1) is remediating the disadvantages in basic skills of 
educationally deprived children. TIERS Model A contrasts the 
achievement ,bf Chapter 1 students to publishers' norms for 
hypotheticayy comparable groups of students. However, the gains 
reported by districts using f all-to-spf ipg test irif cycles far exceed 



those of districts on the annua 
spr irig-to-spr ing) . Three s 
account for this difference 
tttay riot be relevant to Chap 
may be used inappropriately; 
outcomes . Findings supporting 
leading to the conclusion that 




cycle (fall-to-fall or 
of problems in using Model A may 
the norm tables of published tests 
students; (2) the publishers^ norms 
local test ing practices may bias the 
each hypothesis are discussed indepth, 
districts should adopt an annual 



testing paradigm, since f all-to-spring NbE j Normal Curve Equivalent) 
gains are unlikely to be accurate reflections -of the true impact of 
Chapter 1. (TE) 



********************************* *-**** ******* 

- - - - - 

* Reproductions supplied^by EDR5 are the best that can be made % * 

* from the original document. * 

******** *4|* *********************************** ************************** 



ERLC 



a 

fctJ 



DIFFERENCES BETWEEN 
FALL-TO-SPRING AND ANNUAL GAINS 

IN EVALUATION OF 
CHAPTER 1 PROGRAMS 



J. WAItt^KEESLING 



Advanced Technology, Inc. 

7923 Jones Branch Drive 
McLean, Virginia 22102 

Prepared For: 

Planning and Evaluation Service 
Cl.S. Department of Education 



SCOPE OF INTEREST NOtlCE 

The Eric Facility hat assigned 
this document for processing 



In our judgment. th»s document 
» a!#o_Qf . interest to the .Gearing • 
houses noted- to- the right, index- 
ing should .reject their special 
points of view 



U:s; DEPARTMENT OF EDUCATION 

NATIONAL INSTITUTE OF EDUCATIO 

EDUCATIONAL RESOURCES iNFORMATi> 
CFNTER < E FliC > 
This document has been reproduced 
received from the person or orgnm/n 
originating it 

Minor changes have been made to impr 
reproduction quality 



X 



Points of view or opinions stated 'n this do 
mnnj do not necessarily represent officii 
position or policy 



JANUARY 1984 



EDUCATION ANALYSIS CENTER 
FOR STATE AND LOCAL GRANTS 



I s - 



ERIC 



TABLE OF CONTENTS 



OVERVIEW AND SUMMARY 
BACKGROUND 

RELEVANCE OF PUBLISHER NORMS TO CHAPTER 1 STUDENTS 

INAPPROPRIATE USE OF PUBLISHER NORMS 

. ; )' 

LOCAL TESTING PRACTICES 

CONCLUSIONS AND ADVICE POR LOCAL DISTRICTS 

REFERENCES 

) 



J 



LIST OF EXHIBITS 

' (' . 



Exhibit 1: Differences Between Fall-to-Spririg (FSlarid 

Annual (AN) 1979-80 Title I Evaluation Results 
for Reading 



Page 
7 



Exhibit 2: Differences Between Fall-tb^Spririg (FSlarid 

Annual (AN) 1979-80 Title I Evaluation Results 
for Mathematics 



Exhibit 3: The Influence bf_ Pretest Date on Gains When 

Fall.Nbrms Are Interpolated From Spring Norms 



16 



Exhibit The Effects of Early and Late Testing,. and the 

Use of Interpolated Norms for Reading Scores 



17 



Exhibit 5: Computation of Spurious Losses Due to Early 

- Fall Testing 



18 



Exhibit 6: 
I 

Exhibit 7: 



Hypothetical Annual Growth Curve 



Testing Dates for 1982-83 Evaluations^ 
Chapter 1 in Iowa 



19 
30 



ERLC 



OVERVIEW AND SUMMARY 

The Title I Evaluation and Reporting System* (TIERS) was developed in prder to 
examine the extent to which Title I (sow Chapter' lj is remediating the disadvantages in 
basic skills achievement of educationally 'deprived children. Data collected via r TlER§ 
are intended to answer the question, "How much more did pupils learn by, participating 
in the Title I project than they would have learned without "it?V (Tallmadge and Wood, 
1976a, p.2). 1 ' 

Most LEAs use TIERS Model A, which contrasts the achievement of Chapter 1 
students to publishers' norms for hypothetically comparable groups 61 students. One 
clear piece of evidence that it may not always be appropriate to use publisher norms as 
the comparison is the large discrepancy between gains reported by districts using fall-to- 
spring testing cycles compared to those using annual testing cycles (fall-to-falj, or 
spring-to-spring) in ModelA. At the national level the aggregated differences between* 
the testing cycles are larger than the aggregated gains reported under the annual cycle. 
This is evidence of a strong method effect. This effect seems to be largely due to the 
fact that fall test scores are very low; the spring test scopes do not seem^to vary 
according to the testing cycle employed. This suggests that the true effects, of Chapter 
1 aresimilar no matter what the testing cycle and that the problem is confined to the 
fall testing. . - * 

Three sources of problems in using Model A are explored for the possibility that 
they account for exaggeratedly low fall scores (or higher spring scores) that would ; 
account for the difference observed between the two testing cycles. These % problerri 
sources are: % * ■ • - * 

* -*. - v 

• The norm tables of published tests may not be relevant to Chapter 1 students. 

• The publisher's norms may be used inappropriately. ^ 

• Local testing practices may bias thefcutcbrn^. 

There are two ways in which the norm tables of published tests may hot be relevant 
to Chapter L students; T he samples of students used by publishers may not be 

Notej^ This r^port.was prepared pursuant to contract Number * 
300-82-0380, U.S. Department of Education. The technical monitor 
for -this Report was Dr. Robert ,Stonehill, U.S. Department of 
Education. The opinions and conclusions expressed inVthxs vfeport 
are those of the authors arid__do hotr 6 necessarily _ r^resfent /the 
position or policies of the U.S. Department of EducatioSr 



representative of these students* and the curricula implicit in the tests may not 
correspond to the content of texts used to instruct Chapter Incidents. Bagliri (1981) 
reports that very'small fractions of the test publishers 1 initial samples of districts agree 
to participate in norming studies, and that acceptances are harder to obtain from large 
Urban districts. Jaeger (1979) reported that different tests would report different 
Normal 'Curve Equivalent (NCE) gains for the same, percentile change on a Common 
scale. These effects were more pronounced for percentiles in the ranges served by 
Chapter 1. * 

j V 

_ - A - - - -* __ _ _ - 

Freeman,^ Kuhs, Porter, Floden, Schmidt and Schwille (1983) show that tests and 
texts show considerable variation in overlap. *Linn et ah (1982) conclude, "Careful test 
selection arid/or adjustments in the instructional materials" to improve the match 
provides a project with a net advantage in comparison to < the norms against which the 
gains are judged . . . A test that covers the curriculum to be taught would presumably 
show low scores in the fall relative to the norm (the students in the project will have less 
'exposure to the content than the norm group because that content is yet to be taught in 
the project), and higher scores in the spring (because ^the project students will have more 
targeted instruction than the norm*group). 



AltKbUgh it is hand to quantify these effects , the- investig a tion re ported- here 
suggests that Both the ^^representativeness of the samples and the variations in 
- curficu lar overlap are likely to make publi s hed norm tables Inappropriate to the 
assessment of Chapter 1 students when this assessment is conducted in a fall-to-spring 

^ testis cycle. 

There are two primary ways in which test publisher norms can be used inappropri- 
ately. Errors can be made in converting raw scbfes to NCE scores, and interpolations to 
account for discrepancies between the actual testing date and the norming date can 
introduce biases. 

Errors in converting raw scores to NCE equivalents may occur with some 
frequency, especially where the procedure i? hot automated. Linn et al. concluded that 
conversion errors could result in spurious gains of about i NCE in magnitude. Because 



2 



6 



' '•■ ■ • \ 

• fall-tb-sprihg testing may involve more tables, conversion errors may combine to 

produce even largbr spurious gains than in annual testing cycles , 

\ 

Interpolation between the test publisher norming date and the actual date of 
testing is usually performed using ah assumption that growth is linear-between the fall 
empirical norm and the spring empirical norm. Evidence from a variety of sources 
reported in the paper shows that this assumption is hot likely to be accurate. The result 

values. Testing early in the spring does hot make up for the spurious deficit in the fall 
because the growth curve is steeper in the, f all than in the spring. It is likely that ihis 
effect contributes about 2 NC5? to the difference between the two testing'cycles . 

Local testing practices Cart also have strong influences on the outcomes of f Chapter 
1 assessments. Several authorities mentioned "stakeholder effects" that would tend to 
make fall scores lower than anticipated and spring scores higher. On an annual testing 
cycle, these effects probably balance out^ but most would tend to exaggerate gains in a 
fall-to-spring testing cycle. Among these effects are: 

% • Not encouraging the best performance oh fall pretests 

• Increasing motivation to do well oh the spring pbsttests 

• Teaching test-tafcmgf'skills 

• Teaching specific test items 

• Coaching during the posttest 

• Selecting out low-scoring students at the postest 

• Holding lower performing students back a grade 

No published, studies of these phenomena could be located that would permit 
^estimation of the magnitude of these effects. One very carefully documented report 
from a larger school district did reveal that teaching test items can produce very large 
gains (21 NCEs in this particular case). It did seem clear ,-however, that most of these 
effects would be likely to contribute to the observed discrepancy between the gains 
reported using annual and fall-to-spring test cyc l e s . 



3 



The more modest gains reported by projects using annual testing cycles correspond 
to the gains reported in other studies of : the effects of compensator education. Because 
these gai^s represent increments over-and-above the expected growth in basic skills 
achievement^ even modest gains, if cumulated, can become important^ For example, a 
student ^t the 25th percentile would be moved to the 35th by three years of exposure to 
a project that produced annual gains of 2 NCEs. It is important <o continue S> c&lfect 
information via TIERS to 'document that Chapter i is capable of having this sort o! 
impact. 

The best advice to be given now is to repeat the conclusion of Linn et al. (1982) 
v that districts should save money and testing burden by adopting an annual testing 
paradigm. Fail-to-spring NCE gains are unlikely to be accurate reflections of the true 
impact of Chapter 1. V 

*> . - 

ft — 




DIFFERENCES BETWEEN FALL-TO-SPRING . 
AND ANNUAL GAINS IN EVALUATION OF CHAPTER 1 PROGRAMS 



. - _ , j • \ _ . 

^ The Title I Evaluati&ri arid Reporting System (TIERS) was developed in order to 
examine the extent to which Title I (now Chapter' I) is "working" to remediate the 
disadvantages in basic skills achievement of educationally deprived children.^ The system 
utilizes evaluation models developed by RMC (Taiimadge and Wood, 1978) under a \- 
contract from the United States Office of Education (USOE). This contract was a part 
of USOE's efforts to implement those sections of the Education Amendments of 197^ 
that required" the Commissioner of Education to provide assistance to state departments 
of education to assist local educational agencies to develop and apply systematic 
methods of evaluation. 

. Data collected via TIERS ar^ir^tended to answer the^ question, "How much more did 
pupils learn by participating in the Title I project than they would have learned without 
it?" (Taiimadge arid Wood, 1976a, p.2). This question can be given a more formalized 
expression, utilizing a variation of a general model proposed by Ftubin (1972), as follows: 

The effect on a particular student's achievement of participating in 
Chapter 1 supplementary jsrograms versus participating only lt l _ the__asoal 
curriculum is the difference between: _ (1) _the .achievement score of the 
student at post test the student _received^Chapteri services (f 3 period of * 
time), and (2) thg achievement score of the student at |_posttest_if the student 
received only the usuai'curricuium (dur i np the same per i od of time ); N ^ 

r . ._■ • ■ . • 

Because an individi^alstadent mast be assigned to either Chapter 1 or to the usual* 
curriculum for a given period of time t this ideal model cannot be implemented- An 4 
alternative model, the randomized experiment, has been developed td^ provide a 
framework in which "the expected value of the difference in mean (achievement) scores 

... is equal to the average difference that- would be observed if all (students could be : 

_____ _______ *___ x 

exposed to both Chapter 1 supplements and to the usual Curriculum alone) during the ^ 

same time interval" (Lirin^rid Slinde, 1977, parenthesized material added). TIERS 

Model B utilizes random assignment, but is .very infrequently employed in actual 

evaluations of Chapter 1.- • 



9 

ERLC 



\ 



-Most local education agencies (LEAsJ use TIERS Modll A, which contrast^ the 
achievement of Chapter 1 -served students to a ihypbthetifally comparable usual- 
ttrpficulum-oniy group. Comparability rests or) the assumption tha^ a Chapter 1 student 
would, if exposed to the usual curriculum only, remain /art^e same percentile rank 
among all students throug.hou^r their educational experiences. The national norms 
supplied by publishers of st^nd&jdized tests are Used to estimate the expectation under 
the usual~curricufum-only condition. , 

One clear piece of evidence that it may not always be appropriate to use publisher 
norms to estimate the usual-curriculum-only condition is the large discrepancy between 
gains reported by districts utilizing fall-to-spring testing cycles compared to those u^ing 
annual testing cycles (either fall-to-fall, or more often, spring-to-spring) in Model A. 
Exhibits 1 and 2 are taken from Anderson (undated) and show the magnitude of these 
discrepancies. 



These differences inf gains systematically favor the fall-to-s'pring testings cycle and 
seem to be largely due to the differences in pretest scores: The fall -tests yield low* 
scores than the spring tests. Since the postte'sts are relatively close in magnitude 
(except for the upper grades, which will have large standard errors as a function of the 
small numbers of projects operating at those levels) , it seems that the spring results are 
quite similar and do not depend upon the testing cycle. ThUs* the major issue is why the 
fall test scores are so low. „ 

The differences in gains are far from trivial* The median difference^ is- 3.8 Normal- 

Curve Equivalents (NCEs) for reading, which exceeds any of the aggregate estimated 

reading gains from the annual cycle-reports. The median difference of *5.7 NCEs for 

- ^ _ _. _ i • 

mathematics also exceeds the largest a^^egate mathematics gain estimated „ from , 

annual cycle reports. There is a powerful method effect at work In these data> 

The- remainder of this paper explores reasons why these two testing cycles produce 

such discrepant results. These reasons will be developed* in the context of the,Mbdel A* 

-- . .'. . __*^-v w. - 

variation on the general model proposed above, and will reflect oiji the suitability of the 

assumptions, utilized by Model A. 



EXHIBIT 1. Differences Between Fal 1 -to-Spri ng (FS) and Annual (AN) 1979-80 
Title I Evaluation Results for Read 1 rig 



Weighted Normal Curve Equivalents Weighted ' 



Grade 




Prete 


(St 




Posttest 




Gain 




Number Tested 




FS 


AN 


Biff. 


fS 


AN 


Biff. 


FS 


AN 


Oiff. 


FS 


AN 


2 


30.8 


37.6 


-6.8 


40.2 


38.6 


1.6 


9.4 


1.0 


8.4 


310,555 


85,019 


3 


28.2 


34.3 


-5.6 


36.1 


i 36.7 




7.4 


2.4 


5,0 


293,909 


108,708 


4 • 


28 s .7 


34.7 


.-6.0 


35.6 


36.6 




6.9 


1.9 


5i0 


270,826 


108,576 


5 


29.4 


33.9 


-4.5 


35.5 


36.2 


4^0 ' 


5.1 


2.3 


3.8 


246,159 


112,387 


6 - 


29.7 


33.9 


-4.2 


35.7 


37.2 




5.0 


3.3 


2.7 


212,819 


107,706 


7 


28.8 


33.9 


-5.1 


34-3 


35.8 




5.5 


1.9 


3.6 


152,417 


66*923 


8^ 


29.0 


33.6 


-4.6 


34.0 


35.8 


-i!s 


5.0 


2.2 


2.8 


I22i013 


58,026 


5 


28.3 


32.0 


-3.7 


33.5 


33.8 


-0.3 


5.2 


1.8 


3.4 


66j475 


'30,082 


10 


28.6 


30.2 


-1.6 


32.8 


29.5 


3.3 ' 


4.2 


-o.z 


4.9 


36 sl02 


14i215 


11 


27.3 


27.5 


-0.2 


30.5 


25.3 


- 5.2 


3.2 


-2.2 


5.4 'i 


17 s 734 


. 8,579 


12 


25.6 


25.4 


0.2 


30.0 


26.8 


,3.2 

> 


4.4 


1.4 


3.0 


8j383 


7*146 



EXHIBIT 2. Differences Between Fal 1-to-Spring (FS) and Annual (AN) 1979-80 
Title I Eval uatiorT Resjjl t^ for Mathematics 

c 1 ----- - — j> 

/ . ~y r ~_ Weighted Normal Curve Equivalents ~7~. Weighted, 



Grade 




Pretest - 




Posttest' 




Gain 




timber Tested 




FS 


AN 


Biff. 


FS 


AN 


Biff, 


FS 


AN 


Biff. 


FS 


AN 


2 


32.0 


41.9 


-9.9 


42.5 


43.0 


-0.5 


10.5 


1.1 




124,576 


50,084 


3 


31.5 


' 39.7 


-8.2 


. 40.1 


40.1 


0.0 


8.6 


0.4 


&»\ 


137,608 


65,407 


4 


30.8 


37.5 


-6.7 


39.8 


39.2 


o.e; 


9.0 


1.8 


7.2 


147,338 


70,637 


5 


30.5 


36.6 


-4.9 


38.7 


39.0 


-0.3 


8.2 


2.5 


5.7 


136,872 


71,038 


6 


30.9 


35.4 


-4.5 


38.6 


39.3 


-0,7 


7.7 


3.9 


3.8 , 


119,003 


69,002 


7 


30.6 


34.5 


-3.9 


36.9 


36.7 


0.2 


6.3 


2.2 


4;1 


74,807 


36,268 


8 


30.1 


34.3 


-4.2 


36.3 


37.1 


-0.8 


6.2 


2.8 


3.4 { 


60,747 


29,530 


9 


29.8 


34.6 


-4.8 


35.9 


35.1 


0.8 


6.2 


0.5 


5.7 ' 


28,579 


15,971 


10 


32.0 


32.9 


-0.9 


37.3 


31.6 


5.7 


5.3 


-1.4 


6.7 ' 


12,192 


7,718 


11 


32.5 


34.9 


-2 .4 ' 


38.1 


35.3 


2.8 


5.6 


0.4 


5.2 


5,270 


4,158 


12 


30.7 


33.8 


-3.1 


37.2 


34.9 


2.3 


6.5 


1.0 


5.5 


2,195 


3,587 



In the Mojdgl A variation bri the general model for evaluating Chapter 1^ the 

1_ • N '- __ ]_ i _ _ _ _____ ' _____ ._ 

growth of Chapter 1 students is contrasted to the .growth of -students in thfe publishers 1 
norming studies who achieve at the same level at the time of the pretest. Ttffree flaws 
can occur tp make this an inappropriate comparison: 

V 

*• _ # the norm tables of published tests may hot be relevant to Chapter 1 students 
« the publishers' hbrrtts may be used inappropriately^ and 

_ _ _ -_L 

• local testing practices may bias the outcomes. 



l£?c£ of these writ be discussed in turn and related to the phenomenon of fali-to-spririg 
gains being larger than those from annual testing cycles. 



RELEVANCE OF PUBLISHER NORMS TO CHAPTER 1 STUDENTS . , • 

Published norm tables for different tests may hot have ecjual relevance to Chapter 
1-served students. There are two reasons for this: f 



• the norming groups may not be representative of Chapter 1 students, and 
■ '.3 

• the tested curriculum may not be the curriculcfc that is taught. 

f - ' ' V. 

The evidence to be presented indicates that publishers may hot attain fully 
representative norming samples, and that there ' are ..considerable discrepancies in the 
implicit curricuiar contenf of standardized testSi t^rge differences in NCE gains carv 
result from such variations iffc samples of students affljj content. Some of these variations 
may directly influence the difference between fall-to-sprihg or annual gain scores, while 
others may contribute to interactive effects that are discussed in a subsequent section. ; - 

Test publishers select districts for participation in their norming studies 'Usirig 
probability sampling methods.- that would permit the construction of acturate national 

-r v *...»«_"' ' _' _ _ _ _ — " . 

norms if all the selected districts Agreed to participate. Baglih (1981) reports, however, 
thai only 13 to 32 percent of the ^initially selected districts agreed to participate in 
^recent norming studies, and that some publishers were unable to fill some of the 
. sampling cells specif ied* by their design. BagiirT (persbrvat -communication, 1983) also 



I 



ERIC 



ERIC 



states that the publishers did have difficulty in persuading large prban districts to 
participate in norming studies (particularly those with enrollments in excess of 100,000 
students). It is riot clear that weighting the results can make up for missing one or more 
of these large districts. Under many reasonable sampling schemes the nation's largest 
districts would come into the sample with certainty and no amount of weighting could 
compensate for a refusal to participate. 

Strand (Test Information Center, personal ^communication, 1983) indicates that her 
attempts to determine from test publishers what proportion of students participating in 
norming were served by Title I was unsuccessful. Thus it is hard to say whether the test 
publishers have represented the Chapter 1 population adequately in the norms. This 
could have serious consequences for Chapter 1 evaluations. 

Suppose, for example, that norm-group students who achieve at the same level as 
Chapter 1 students on the pretest are not as likely to be ecoriomically disadvantaged and 
that they have higher rates o?academic growth because of that difference. Over time, 
Chapter 1 students might riot maintain the same percentile ranking because of the 
difference in growth rates. ^By itself, this effect might not have consequences for the 
difference between annual gains arid fall-to-spririg gains, but it may interact with other- 
phenomena to produce some of those differences, as discussed in a later section of this . 
paper. 

If alt nationally nbrmed tests were equally appropriate to : the Title I/Chapter 1 

_■ # — _ 

population of students, then one would expect that similar percentile^gains (e.g., from 

the 10th to the 15th percentile) on all tests would register«similar NCE gains. A major 

study of nationally-normed tests^ the Anchor Study* (tbret, Seder, Biarichirii arid Vale, 



^The _Anchor Study used editions o^f tests that are now out of date. /These tests were 
normed in an era when tte ^cceptence^ 

was considerably higher than itjs at present. Foj^ example, CTB^ 

to the Technical Advisory Committee of the Systemwide Testing Program of the Depart- 
ment of Defense Dependents Schools (October, 1982) that 85 to 90 percent of their first 
choice districts participated in the 1968 norming of the CTSS Form Q, while only 15 
percent of the first choices participated in the norming of CTBS Form V a decade later. 



13 



1975)* compared the scalirigs of eight standardized reading comprehension tests* arid 
cdhckided that the scales, seemed generally comparable. n However, Jaeger (1979) 
performed extensive secondary analyses of these data and concluded that identical 
percentile gains on the common scale derived in -the Anchor Study would result in quite 
different NCE gains being reported for the eight different tests, especially for scores iri 
percentiles below the 20th. Linn, Dunbar, Harnisch and Hastings (1982) were uncertain 
as to the meaning of the lack of national representativeness, although they cited work by 
Roberts that indicated that quite different NCE gains could result from different 
normative samples. 

While it is not certain that the norm groups used by various test publishers vary iri 
the extent to which they are representative of Chapter 1 students, the evidence is that 
such variation may exist* arid is certairily important. It does seem likely that certain 
types of Chapter 1 students (those in large urban districts) may be underrepresented in 
jnorming groups, and this degree of Underreprofcentation may be increasing over time. ; 
The resulting bias iri the estimated national growth* rates for lower-achieving students 
could contribute to spurious assessments of the impact of Chapter 1. 

• * 

Walker and Schaffarzick (1974) demonstrated that "students using different curri- 

_ _ ___ - ^ ___*» _ _ ___ 

cula in the same subject generally exhibited different patterns. of test performance, and 

that these patterns generally reflected differences in the content inclusion and emphasis 

•in the curricula." Wiley .arid Bock (1967) give a short example shbwirig that very large 

differences in outcomes result from conscious choices t6 include or not include certain 

material iri the curriculum. Tallmadge (1977) reviewed many other studies that showed 

that the content coverage of nationally riormed tests varied widely. Wiley (1979a) 

asserted that variations in curricular content coverage could very easily mask other 

variations (e.g. pupil-teacher ratios, presence or absence of Chapter 1 funding) iri 

instructional settings that might be the objects of evaluation. Leinhardt and Seewald 

(1981) proposed measures of curricular/test overlap to use in conducting research and 

evaluation. 

More recently, Freeman, Kuhs, Porter, Fioderi, Schmidt arid SchwHle (1983) have 
demonstrated that popular textbooks and popular tests do not cover the same curricular 

10 



14 



content in fourth-grade mathematics. Freeman* Belli* -Porter* Flbderi* Schmidt and 
Schwille (1983) have pursued this further to demonstrate that the manner in which the 
teacher utilizes the textbooks can also influence the degree to which the curriculum 
overlaps the test. The implications of this literature are that the choice of text and 
teaching method may have an important influence on the degree to which students are 
exposed to the curriculum implicit in the standardized test used to evaluate the 
outcomes of instruction. 

• • ' ...... ) 

It will be useful to illustrate how much of a difference this match can make in the 
content coverage. Using tabulations in Freeman, Kuhs, et al. (1983), it can be 
determined that if one were to use the Houghton-Mifflin textbook in fourth grade, the 
Iowa Test of Basic Skills (ITBS, also published by Houghton-Mjf flin) would cover *2.9 
percent of the topics to which the text devotes 20 or more problems. The Metropolitan 
Achievement Test (MAT) and the Stanford Achievement Test cover less than 31 percent 
of these topics. The CTBS Forjn S, Level II would cover nearly k7 percent of these ;j 
topics, but this test only has spring norms, and according to the California State 
Department of Education (Test Planning Gu ide , 1982) the range of reliable measurements 
for this test extends from the 12th to the 92rid percentile* which does hot cover the 
J achievement range of many Chapter 1 students. 

The district that chooses a test on grounds of tradition, cost, or because of a 
mandate from some other agency (e.g., the state) will find that judicious choice of text 
can make a large difference in the coverage overlap. Freeman, Kuhs, et al. (1983) show 
that the text published by Holt provides at least 20 problems each on 50 percent of the 
topics on the MAT, but only 22.2 percent of the topics on the Stanford. Interestingly, 
this is the maximum coverage provided for either test. As one might suspect, a district 
using the ITBS would be well advised to use the Houghton-Mifflin text as it provides at 
least 20 problems each on topics addressed by 31.8 percent of the tested items— the most 
of any text. 

The User's Guide emphasizes the importance of selecting a test that matches the 
curriculum being evaluated, and with the availability of the literature cited above in 
additon to this encouragement, it is likely that test choices have tended to enhance the 



11 ' 

15 



ERLC 



overlap between the two. Linn et ai. (1982) conclude, "Careful test selection and/or ^ 
adjustments in the instructional materials to improve the match provides a project with 
a net advantage in comparison to the riq^ms against which the gains are jujlgedi . ; M 

It is hard to reach a firm .conclusion* however, as to whether that advantage is 
more important to fall-tb-spring testing cycles or to annual testing cycles. Baglin (1981) 
reports that test publishers found districts that were using their texts to be more willing . 
• to participate in norming studies. This means that the norms are perhaps slightly biased 
to reflect greater test-curriculum overlap than would be true in a strictly random 
sample, and the riqrm groups -are possibly biased to the extent tha^ some textbooks may 
be used by certain segments of the population more than others (which leads back to the 
question of representative norm groups discussed earlier). If these relationships were 
perfect, than we might be able to speak of "user norms" rather than national norms. One 
would hot expect the test-curriculum overlap to create any problems if the national 
norms were truly "user norms" (except, perhaps, in aggregating results across projects). 

«tfr J 

Unfortunately, iy is hard to find empirical evidence to demonstrate that higher 
than average test-curriculum overlap (relative, to the national norm) will enhance NGE 
gains. One extreme example with very large gains is given later in the paper. 
Presumably if a test is given in the spring at the end of ah instructional year in which the 
overlap has been higher than average, the posttest results should reflect a higher 
percentile standing. However, it is hot clear that exposure to another year of higher 
than average overlap will produce the additional increment needed to make further NCE 
gains in a spring-to-spring annual testing cycle. 

The same could hold true of fall-to-spring testing depending upon the level of the 
test. A fall test that covers the content of the previous year will reflect gains due to 
greater curricular overlap, and a subsequent year of instruction in content that overlaps 
the same test (used again in the spring) may hot yield gains relative to the norm. 
However, choosing a test (to be used in fail and spring) that covers the to-be-taught 
curriculum might result in students scoring below the nocm group in the fall and, with 
more exposure to the relevant curriculum during the year, scoring higher than the norm * 
group in the spring. ^ 



12 



16 



Another possibility is that a test that is more sensitive to the curriculum might be 
sensitive to summer forgetting among students* exposed to that curriculum; While most 
studies of summer gains or losses show that students tend to attain sortie growth in basic 
skills during the summer, the rate is quite a bit slower than^the rate during the school 
year (Carter, 1980). Suppose that this slower rate of gain is a reflection of the loss of 
skills taught in school but unreinforced during the summer, and gains on other skills that 
are reinforced during the summfer. A curriculum that is closely mapped to a particular 
test might result in students appearing to lose ground relative to the norm during the 
summer. The "saw-tooth" pattern of growth (Linn et al., 1982; Linn, 1981) may be 
exaggerated wh'eri the test used to measure growth is highly related to the curriculum 
used to 'instruct students. This could lead, to higher than average 1 fall-to-spring gains, 
while annual testing might produce little gain. 

* 

Clearly, the degree of overlap between test arid curriculum is an .important 
influe'nce on achievement gains, and may account for a substantial part of the difference 
between annual and fall-to-spririg gain scores. In combination with the evidence that 
norming groups may underrepresent Chapter 1 students, it appears that national norms 
for standardized tests may have only limited relevance to the evaluation of Chapter I 
students. At higher grade levels where Chapter I students are typically behind by 
several grade levels and may not be exposed to a curriculum at all like the one implicit 
in the tests, the nprms could be much iess relevant than those for younger students. This 
could explain the declining trend in NCE gains' (especially true, of fall-to-spring -gains) 
from the lower to higher grade levels. 



INAPPROPRIATE USE OF PUBLISHER NORMS ; 

Test publisher norms are usually presented as extensive tabulations of conversions 
from raw scores to various other scales: percentiles* grade equivalents, stanines, 
expanded scale scores, and NCEs, to name some common scales. These tabluations are 
usually presented for specific periods of the year* so that testing accomplished within 
specific periods can be referred to the norm tables. Two^flaws in the usg of these tables 
can cause spurious NCE gains (or losses): 

• conversion errors in which a table look up is performed incorrectly, arid 



13 



certain tttat conversion errors favor the fall-tb-spririg testing cycle. TIERS assessments 
of gains involve the contrast of two score averages no matter what the. testing cycle, 
Howevef, the fall-to^spririg .cycle probably involves the use of two different sets of 
tables, while an annaai cycle couicTuse only one, and this might increase the numbers of 

conversion errors. ^ 

v 

Another source of the difference between- fkll-to-spring arid annual gains is the. 
fact that test publishers do not have empirical norms for ail common testing dates in the 
fall and the spring. Older tests often had empirical norms only in the spring. Fall norms 
were created by interpolating between spring norms. The User's Guide is quite clear that 
tests must be used at times close to the publisher's epnpirical norm dates. Perhaps 
because of this strong insistance, most publishers now have both a fail afcd a spring 
empirical norming. Strand (1983) names the six tests most commonly used for Chapter 1 
evaluations, arid has indicated (in a personal communication) that all of these have both 
fall and spring norms. It should be noted that not all LEAs may be using these tests. As 
recently as the 1979-1980 school year^ the State of California reported (Test Planning 
Guide, 1982) that 27 percent of schools in compensatory education programs were using 
tests with interpolated fall norms. 

It is worth showing an example to indicate how much the use of interpolated fall 
norms can distort fall-io-spring gains. The data presented in Exhibit 3 come from a 
secondary analysis of data collected by the State of California in its eyajuatidn of the. 
Early Childhood Education Program (Burstein, Keesiing, Contain and Doscher, 1977). The 
tests involved are forms of the CTBS (Q, R, and S) that do not have empirical fall norms. 
The California State Department of Education interpolated (linearly) between spring 
norms to derive a single fall norm (set at October 15th). 



15 



EXHIBIT 3. The Influence of Pretest Date oh Gains When Fall Norms Are Interpolated 
From Spring Norms. 4 ' 

^ — # — : : :: ' — ■ — 

GRADE 1 ^ GRADE 2 GRADE 3 

Month Reading Math ~ Reading Math Reading Math 

September 15.6 l9.k ' 13.1 ' 16.6 13.3 15.6 

October 10.5 , 15.6 10.3 13.9 9.7 12*0 

Difference 5.1 3.8 2.8 2.7 3.6 2.6 



SOURCE: Bui-stein* Keesling, Conklin and Doscher, |977. Table 3 (pa^e 163) 

recomputed to show gains in NCE units. Nearly 100 schools tested in 
September and about 200 schools tested in October. At least 85 percent of 
these schools tested in April of the next year. 

Testing at any time during September (which could be up to six weeks prior to the 
interpolated norming date) will result in a spuriously low pretest NCE score and a 
correspondingly inflated NCE gain score because September levels of achievement will 
generally be lower than October levels; indeed the linear interpolation model hypothe- 
sizes just this effect. As Exhibit 3 shows, the advantage of early testing amounted to St 
least 2.6 NCEs. Burstein et al. showed that interpolating exactly to the date of testing 
reduced these spurious gains* and accounting for slower growth rates over the summer 
(non-linear interpolation from spring-to-spring), reduced the differential gains even 
further. . 

As indicated earlier, the tests used most widely in Chapter 1 evaluations have 
empirical fall and spring norms. These norms mean that interpolations or projections can 
be made much clQser to an actual data point, which should reduce the size of artificial 

gains. However, such artifacts are hot entirely eliminated* as demonstrated below. 

J 

The User's Guide recommends that testing not occur more than two weeks before 
or after the publisher's norm date, bat is not. willing to declare test scores entirely out of 
bounds unless they are obtained six or more weeks away from the norm date. There is a 
variety of ways of dealing with the test data that arise from dates discrepant from the 

publisher's norm. • 

*. , * ■ » 



la 



One state evaluation office (personal communication^ 1983) indicated that they 
were using a canned computer package to process TIERS infortnation that "threw out" 
any LEA report that involved testing more than a total of 30 days away from the 
published norrns (adding together early testing in the fall and late testing in the spring). 
This system does, however, allow one to test 30 days early in the fall^ and it Compares all 
acceptable fall tests to the sartfe norm, so that pne can gain an advantage (spuriously low 
pretest score) from early fall testing. ^ ~ - 

A study by the California State Debartmerit of Education (Test Planning Guide] 

0 

1982) gives some indication of the probiecg^ that may be anticipated by testing too early 
or too late,, and by using projected norms. Exhibit U condenses the results, which show 
that early fall testing and late spring testing carivcombine to yield a spurious gain of 
about * NCEs. > 



-EXHIBIT 4. The Effects of Early and Late Testing, and the Use of Interpolated 
Norms for Reading Scores 



Source of Effects Effects (in NCE) on 

Pretest Postest 

y^ n S interpolated fail norms -2 — 

^*rLyA^*L n s -2 0 

bate testing +3 +2 



SOURCE: Test Planning Guide published by the California State Department of 

Education. This table combines results for both fall-spring and annual testing 
cycles; the source does not report them separately. The source does not 
indicate the extent of early and late testing. The original tabulation was in 
percentile e_ffects_which ha$e been converted to NCEs using the State 
average of 38 NCEs as the starting point. 



Exhibit 5 presents more evidence of the effects of early fall testing. Eyeri within 
the grace period recommended by TIERS it is possible to obtain an artificial loss in the 
fall of 3.7 NCE units. Scoring services provided by some publishers project norms for _ 
early fail testing (based on fall to spring growth) to the exact date of fall testing. They 



17 



2g 



ERIC 



usually assume, however, that a linear growth rate is appropriate. Exhibit * shows that 
tHe difference between early arid late testing is larger in the fall than it is in the springs 
and it is riot difficult to imagine that the actual growth curve of achievement might be 
likeSfaat shown in Exhibit 6. The linear projection or interpolation of norms based on fall 
and spring norming dates will lead to misrepresentations of the fall and spring NCEs and, 
consequently, the gains. These effects will probably ; ipf late fall-tb^spririg gains* while 
they will not greatly influence annual gains if the annual testing occurs at the same time 
each year. 

EXHIBIT 5. Computation of Spurious Losses Due to Early Fall Testing 



CTBS Form S, Level B was normed in the cycle Spring-Fall-Spring. The reported means 
and standard deviations are: 

Raw Score Raw Score _ 
Month Me^n SD 

April (0.7) 31.3 12.2 

Nov. (1.2) 35.6 13.7 

April (1.7) ., 59 A 18.4 

Fall (Nov.) to sprin^g (April) normal progress would be made at the rate of r 59.4-35.6/ 150 
^fy 5 .^.?*]^!?^^ 0 ^ 6 .? 0 ! 1 }^.?^ day. Using the standard jJcviati^on of J 3.7 »^/e can 
compute standard deviation units lost for each day testing occurs prior to the norm i_ For 
example: 30 days x 0.16 points per day = 4.8 points lost _ior Jesting one month eariyi 
This is equivalent to 4.8/13.7 - 0.35 standard deviation units or 0.35 x 21*06 = 7.4 NCE 
unitSi * $ ) ' . 

Computing these results for some likely testing dates yields: 

Testing at Which is Produces a loss of 

Mid September 6 weeks early li'.l NCEs 

Early October * U weeks early * 7.3 NCEs 

Mid October 2 weeks early 2 3.7 ^CEs 

r 

^Allowable under one reporting system if posttest is on norm date. 
^TIERS recommended maximum gap in testing date. 

SOURCE: Conklin, Burstein and Keesling, 1979. 



21 ■ 

ERIC 



1 - . — 

A related, question of importance is the iricidehce-6f early, and late testing. The 
information in the Test Planning Gui de (California State Department of Education, 1982) ' 
shows that in 1979-1980 evaluations, 16 p^n^t of 2^527 schodls evaluating compensa- 
tory education programs pretested early (by at least one day). Forty-eight percent, (pf 
2,527 schools) tested late in the spring? Greater detail oh testing dates was obtained 
from th^r Iowa State Department of Education (via . personal dommunications, 1983). 
<> &ata from this sourqe are summarized in Exhibit ;7. 

_ _ . /- ■ » 

EXHIBIT 7. Testing Dates for 1982-83 Evaluations of Chapter 1 in Iowa j 

POSTTESTING DATES 



Early —~y ~ Late— — — - 

?RETEST at least lb to 5 to 1 ta^S At lto 5 or more 

DATES 1 5 days 1ft day s . 9 days ft d a y s J^o rm ft day s days 



15 or more 

. days early 2 1 1 0 0 0-0 

' . . r 

10 to 1ft •! . 

days early 1 19 ft 1 0 10 

5 to 9 • ^ 

days early ,,0 ft 7 1 0 0 \ 

1 to ft - * 

days early 2 1 . 7 9 ' "1 6 I 

At Norm 0 0 l * D 1 ft 3 1 

.. - 

1 to ft 

clays late 0* 1 1^6 0 2 



c 



5 or more 

. days late Jo \_ I I • . 4. 4 

S • ' ■ 

Total 5 27 21 19 % 5 13 6 

_* ; 

The tabulation gives the percentage of 498 second grade reports (typically one per 
school) at each combination of pretest and posttest times. Rounding errors make the 
total add to 97 percent. r 



-20- 



— - •-*§-- - . ^ _ ^ * - 

While the. data in Exhibit 7 revearthat most schools are .testing, well within the 



7 



TIE^ recommended time limits, there is a clear bias in f^vor of early testing. Twenty- 
thr^K percent test earlier in the fall than in th& spring (the above-diagonal entries). 
Tftey should show positive biases because the time elapsed between the testings is 
" greater than the^me^etween normings and because of non-linear growth as hypothe- 
sized in Exhibit ,6. Thirty-seven percent test early by the same amount fo.fail apd spring 
(the first four diagonal entries). M the model ot Exhibit 5 is cbrrect,-^heri these cases 
will show" a bias to spuriously low fail scores and, consequently, spuriously high gains, 
because the early pretesting is not fully compensated by early posttesting. Depending 
upon the shape of the curve some of the casQj^here the postesi is earlier relative to the 
norm date than the pretest might still show the same bias because the pretest effect is 
much larger than the posttest effects Th;s m^ris that there will be a! bias to jpuridusly 
high gains in even these fall-to-spring testing cycles. , \ , 

>_ .... " _ _ . .... 

v « - One of the explanations offered for early testing in the spring is that the schools 
want ' to be sure that they receive their results in time for the reports that are due to 
^heir respective stHte departments of education. Early fall testing is motivated by a 
desire to let teachers know more about the students they are teaching. There did hot 
S^em to be any reasonable explanation for the late testing in thje f aih 

* c - 

Linn (P981) makes a strong case that more should be known about growth curves 
before it will be easy to compare the growth of brie group of students against that of 
another. Having two norming points for most tests is simply not enough. Most of the 
studies that attempt to show, that the norm /group estimates of growth are reasonable 
proxies for tbe asuai-curriculum-ohly treatment condition are based on fall arid spring 
norming points within one year (Tallmadge, 1982; Powers, Slaughter and Helmick, 1983), 
or spring testing points over several years (Tallmadge and Fagan, ^977). To determine 
why fall-to-spring testing cycles appear to be biased we need mor<r thin tivo points on 
the growth curves for the students to be compared* Annual testing seems to have much, 
better prospects for developing growth curves tha$ will be useful tn interpreting the 
nature of the gains made by Chapter 1 students* For example, Boc^ (1975) presents 
growth curves for vocabulary o^er four years that show different curves for high school 
males and females. Knowledge of such effects would be needed to properly assess the 

i 

effects of special interventions such as Chapter 1. 



-21- 



24 



9 

ERIC 



We can now return to a point made much earlier, in this paper. If the samples of 
students in the test publishers 1 national norm' groups who score at the same fall pretest 
levels as Chapter 1 students have a different growth rate* then the curve for the ribrmirig 
sample that would be appropriate for Exhibit 6 may differ from the curve that would be 
appropriate for Chapter 1 served populations. This effect would be confounded with the 
problems of curricular overlap with the tests mentioned earlier: The growth curve of 
students exposed to curricula with greater overlap would be different from^the growth 
curve of students exposed to pther curricula. Furthermore, the growth curves for 
students at different initial percentile rank ranges might be different. This is important 
because some Chapter 1 projects are much more selective than others. Some states and 
districts only include students in the 25th percentile or below* while others include 
students below the 50th percentile. A further complication is the report of Mayer and 
Farrisworth (1983) that suggests that some students continue to grow at the previous rate 
after instruction has ceased* while other? do riot. 

Clearly a rather extensive study would be necessary to isolate' all of these 
-potential effects arid prepare adequate growth curves.. Ultimately, such a%tudy might 

run into the difficulty that there wouRi be so few students truly comparable to Chapter i 

______ _ _____ j __ i 

students* but who are not receiving services, that it would not be possible to generate an 

expected growth curve under the usual-curriculum-only treatment condition. This would. 

mean that the Chapter 1 effects would be included in the growth expectation for 

students at lower performance levels, and Model A would riot be expected to detect gains 

relative to the norms. « 

If conversion errors contribute 1 NCE to the differential gains reported in Exhibits 
1 and 2, and problems with linear interpolation contribute; between 2 and 3 NCEs, the 
median difference is largely accounted for. It should be expected, however, that the 
effects of conversion errors or linear interpolation problems will interact with the / 
problems of representative samples and content overlap. Conversion errors may occur at 
random, but their positive bias may mean that they occur more often when scores appear 
"too low." One incident was related in which all negative gain scores were converted to 
positive gains before the project report was submitted. 



The shape of the -growth curve (Exhibit 6) will surely depend upon the sample of 
students in the norm groups and thekjsature of the, content match, bet weer^ the test and 
the curriculum. A ? single riatibrially~rep*^^frtative growth curve may only be ah 
approximation to the actual situation in any local project. *^ y 



LOCAL 



In discussions with several authorities on testing in the preparation of this paper 
(TAG representatives, test publisher representatives, testing experts,* state and .local 
evaluators), a frequently expressed opinion was that testing in the fall and spring was 
performed under conditions different .from ;thbse specified in the publisher's manuals. 
Se of the interviewees called these differences "stakeholder effects." Stakeholder 
effects are different from the-effects disjcpssed .earlier because they alter the degree to 
which Chapter 1 influences the testing results^ while the others dee^ with the degree to- 
which valid estimates of Usual-curficuium-only treatment effects can be obtained. The 
effects discussed earlier will resul tin ^spurious gains whether Or not there is a Chapter 1 
project in operation; stakeholder effects generally augment any effect due to Chapter 1 
with an effect of an additional treatment condition. When the effect of the additional 
treatment is> not accounted fbr^ Chapter 1 is credited with spuriously high scores. 

Any alteration of .the conditions for testing specified in the publisher's manual 
means that the publishers norm tables are no longer valid. In general, the authorities- 
contacted felt that deviations from the publisher's standardized conditions would produce 
lower-fail scores and fS^per spring scores. The deviations from standard conditions that 
were mentioned include 

• Not Encouraging the best performance on fall pretests (on annual cycles the 
pretest is also the posttest, and best performance is always encouraged) 

• Emphasizing the importance of the posttest, increasing motivation to do well 

• Teaehing of test-taking skills ' 

• Teaching specific test items . 

• Coaching during the posttest - j : " 
i Selecting but low-scoring students at posttest 

• Retention of lower performing students 



I* 



ERIC 



ERIC 



( 

: ■ . * : . 

Unfortunately, none of the authorities interviewed could provide a reference to any 
published (or .fugitive) study 61 these phenomena. A check of all the listings for 198i arid 
1982 in ERiC with the word "testing" Irs the title produced no likely entries. A check of 
the Current Index to 36urrials in Education from 1979 through 1983, under the headings 
"Testing Conditions" and "Testing Problems" also produced rib relevant literature. The 
following compilation of tangential evidence and anecdotes gives a sense of the potential 
magnitude of the problem* 



The basic premise of the stakeholder effect is that the fall testing will be done 
Under conditions that tend to depress scores (or at least urtder conditions that do riot 
raise scores beyond the effects of prior instruction), while the spring tests are conducted, 
Under conditions that will tend to raise scores. Annual testing cycles would result in rib 
spurious gains if these effects occurred in each grade level. Fall-to-spring results would 
be strongly affected. For example^ students probably know that the test in the fall is riot 
important for their grade, or whether they will be promoted to the next grade level. 
Teachers probably tell them to relax arid take it easy f that their scores will riot matter. 
In the spring, however, the test is known to be important^ It may determine promotion 
to the next grade. Teachers probably tell students that it is important and that they 
should try td^do welh They probably encourage them to rest well the night before arid 
eat well cfti the morning of the test. Would they do that for a fall testing? 



One TAC representative suggested that fall testing is intended to identify students 
irt need of services. Even thbUgh students may be taken to a separate area to be tested* 
and given a certain amount of ^.encouragement, the purpose is to be sure to identify as 
many errors as possible iri each test protocol so that 3. profiles of need cari be developed. 
An LEA representative said, "The pretest was farcicah The objective was to qualify as ^ 
many students as possible." Iri this LEA (which has since gdrie to annual testing), the 
time to complete various subsections in the fill was shortened from the publishers 
recommendation, arid the exarhiriers would not clarify directions when asked. . 

Several of the people interviewed suggested that in the spring considerable 
attention is lavished on the preparation f or the pbsttest. Instruction Ah Social Studies 
and Science begins to stress basic skills (sometimes th£$e other subjects are riot tttrght 

-2*- 



27 



at all in the spring to make way for additional basic skills instruction). Teachers stress 

the parts of the curriculum they expect to be represented on the test, and they teach 

test-taking skills. It is arguable that explicit instruction in test-taking skills would 

constitute a special treatment, unlikely to be reflected in the publishers norms. It is also 

unlikely that teachers would devote much time to this subject in preparation for a fall 

test (aHhoughnhey should, if they want to obtain information about the subject content 

the students do hot know, unconfounded with test-taking skills). Lirih et al. (1982), citing 

work by Roberts, indicate that practice effects and instruction in test taking skills can 

have a sizeable influence on outcomeS^seVeral NCEs). Since these effects would tend to 

_ _«__ «* 

balance out bri ah annual testing basis, but are likely to be quite different in the fall than 

in the spring, they could be responsible for much of the difference in the NCE gains 

reported in the two testing cycles. 



Probably the most obvious form of stakeholder effect is the deliberate teaching of 
items that will appear on the test. Achievement tests are usually composed of samples 
of items representing various skill domains. It is assumed that exposure to instruction 
will cover most of the domains to be tested and that the sampling of items will provide 
an accurate assessment of how much progress has been made on the entire set of 
domains. Emphasizing the instruction of specific test items in one or more domains will 
raise test scores, but probably means that the range of those domains has not been 
adequately covered^ sfc~ 

It is important to distinguish this effect from that of choosing a test that 
emphasizes the type of problem found on the test. In the latter case one is emphasizing 
the overlap of the domains taught arid tested, arid while this can invalidate publisher's 
norms (for reasons discussed earlier), it is not the same as emphasizing instruction in the 
exact items to be tested. Maximizing the overlap between domains taught arid tested 
should assure good coverage of the domains, while teaching to the specific test items 
limits the scope of coverage. - 

A good example of this phenomenon has been provided by Stephen Isaac of the San * 
Diego Unified School District (personal cbmmariicatibriji tjnder a court order to raise 
the achievement of stud^ts in minority-isolated schools, the district prepared a mastery 
learning project in basic skills. Evaluators in the district were conscious of potential 



-25- 



problems with the security of the tests they planned id use in the evaluation (CTBS Form 
S) arid eventually discovered that systematic instruction in 30 out of the 4b vocabulary 
items on the test hacj been offered to third grade students during the year. Each of 
these 30 items was included in a set"bf "Word Warm-Up Exercises" that were used to 
start the reading lessons. The stems and responses had been reversed {probably to hide 
this test-specific instruction). In addition to this use in instruction, 6 or 7 ol^tnese items 
were also used (with stems and responses still reversed from the CTBsjformat) in a series 
of "cumulative tests" given to, all children. These tests were returned to the children so 
that they could learn the correct associations. It is estimated that the Word Warm-Ups 
provided direct teaching of the 30 items 3 timeijr each during the year, including one 
exercise that was presented in a format identical to the CTBS testj except for the 
stem/response reversal. 

Results reported for the CTBS Form S vocabulary testing showed a gain from the 
33rd percentile (NCE=41) the previous spring (there was no prior item-specific instruc- 
tion) to the 72rid percentile (NCE=62). The students were able to answer 12 more items 
correctly than before^ Because this test-specific instruction was detected, Form T of 
the CTBS was given soon after norm S arid students scored at the *3rd percentile 
(NCE=46) in vocabulary^ Clearly, teaching to the test invalidates the publishers norm 
tables as a mepns of determining the e*p£cted growth curve. 

: v , ; : [ 

Teaching specific test items year after year would not yield large NCE gains oh an 
annual basis, but would produce large NCE gains in a fall-to-spririg testing cycles Telling 
students the correct answers during the test session will have similar eff ects.v 

Another example of a stakeholder effect is in the selection of students to take the 
tests. Some LEAs may eliminate the scores of some low-achieving students from the 
posttest^ arid therefore, from the TIERS reports. California, for example, permits 
teachers to exempt limited-English proficient (LEP) students from testing iri the English 
language. This may be a perfectly valid reason to protect some students from 
discouraging experiences, but it can be abused by withholding students who should be 
tested (i.e. are not truly LEP), but might make the project look ineffective. This form of 
student selection bias may not contribute to differentiating between the two testing 
cycles. 



Another student selection device that could differentially affect outcomes under 
the two testing cycles would &e to implement a strong policy of retention; Such a policy 
would be unlikely to be reflected in the publisher norms and would result in a number of 
effects that would boost test scores: 

The retained students would score low in the fall (perhaps even lower than the 
status that led to retention would lead brie to predict, because they might be 
depressed at being retained), - 

• The retained students would score Jligher in the spring because they would have 
had another year's practice on the mat^rlaHarid they might be motivated to do 
well). 



The students sent on would be those who grew faster and might be likely to do 
so again. This would boost fall and spring scores. 

Retention might benefit LEAs on fall-to-spring testing cycles. 

i J_ 

tongitadinal studies tracking students for more than one year show that fail-to- 



spring gains are riot maintained (see Liriri et al. 1982; Liriri, 1981; arid Perry, 1983 for 
examples); TIERS reports of fall-to-spring gains appear to be too large; fcinn et ah 
(1982) report that major studies of the impact of title I have shown gains of about 1 ot 2 
NGEs per year in reading* The data from annual evaluation cycles reported in TIERS 
tend to support this degree of gain. Math gains appear to be a little larger than this. 



Much of the evidence we have presented in this paper tends to indicate that the 
fall data point is rridre questionable thah the spring data point. Interpolations or 
projections around the fall data point are more sensitive than are interpolations around 
the spring data point (because the growth curve is steeper in the fall). There is probably 
more variation in testing practices associated *with fail tests than with spring tests; 
Whiie the spring test is probably more generally played lip as important regardless of 
whether one is in a fali-to-spring cycle or an annual cycle, the falf" test may be treated 
quite variably. 

In the metric of raw scores (number of items answered correctly), and possibly in 
expanded scale score metrics, the difference between fall and spring scores would 
reflect the actual amount of learning. Unfortunately these metrics are not comparable 
from test to test. NGE scores are intended to show the incremental gain over arid above 

• ■_- " 27 - : - - 

30 



expectation that can be attributed to compensatory education. But* using NGEs to 
measure gains from' fall-to-spring Requires several assumptions^bout the nature of 
growth curves and the willingness of LEAs to use standardized testing practices that may 

not be realistic. - 

& 

Because NGEs measure incremental effects; even small -value^ are potentially 
important, especially if they cumulate. Carter (1980) shows evidence (p. 152) that about 
60 percent of Title f students in a given year are in the program the next year also. 
Gains of 2 NCEs per year, cumulated for three years, would move a student from the 
10th percentile at the end of Grade i to the i6th by the end of Grade 4. The same gain 
would move a student from the " 15th percentile to the 23rd, ^Lnd a student at the 25th 
percentile would be moved to the 35th. These are respectable gains. 

Even though small LEAs (most LEAs in the country are of this size) will have too 
few served students to reliably detect such small gains, their data is needed in the TIERS 
system to docum&it that Chapter I is producing effects of this magnitude on a 
nationwide basis./The advice to such small LEAs would be not to regard any one "gear's 
results as meaningful for locaf policy setting. Ah accumulation of data over time might 
prove mpre (useful, although the large standard errors of the outcome measures may 
make it difficult to detect effects of changes in the nature of the program (suchFas the 
materials used, the types of teachers and aides employed) and these effects could be 
overwhelmed by any changes in the test used to assess outcomes. 

Larger LEAs who switch from fall-tb-spririg testing to annual testing will probably 
wonder how to handle reporting lower gains. If they had been testing fail-to-spring for 
some time, they probably have noticed that they reported very large gains each year, but 
that the same students were in Title I (or Chapter i) over the years and their' cumulated 
gains were nothing like the sum of the yearly gains. Data that show the drop from spring 
to- fall, but show that fall-to^fall there is some maintenance of gains (e.g.* Perry* 1983)* 
could probably be recovered from sucti testing systems to show that the annual testing 
cycle will give a better estimate of the gain that is likely to cumulate through time. 

Apparently, some LEAs have asked for a way to estimate the fall-to-spHng gain 
that would correspond to the annual gain they are estimating so that they can report 
bettef^oynding news to their school boards and parents. They should be advised to be * 



more straightforward with these constituency 

indicate why the new (annual) cycle will provide better irifbrma tidri -%^t' 
In the interviews conducted with local evaluators in preparing this report, tKose fin tEAs 
that had changed to the annual cycle indicated that the savirigsrj>f money?ahd^t>Ci^d^h 6n 
students and teacher? far outweighed problems with reporting tfie data. 

The b^st advice to be given how is to repeat the conclusion of Linn et ai. (1982) 
that districts should save money and testing burden by adopting an annual testing 
paradigm. Fall-tb-spring NCE gains are unlikely to be accurate Peflections of the true 
impact of Chapter i. ■ T * j 



-29- 



32 



ERIC 



REFERENCES 



Anderson, Judith. Differences between academic year and calendar year Title I 
evaluation results. Washington, D.C.: Undated draft. 

Bag-in, Roger F. Does "nationally" normed really mean nationally? 
> Journal of Education Measurement y 1981, 4S, 97-108- 



Bbcfc, R. Darrell. Multivariate Statistical Methods in Behavioral Re search , 
■ New York: McGraw-Hill, 1975. 

.;• T ; * ; . ... .... \ [.■'::... 

Burstein, Leigh, J. Ward Keesling, Jon Conklin and Mary Lynn Doscher. 

r Audltii^ ' large-scale evaluation: The quality of evaluative information^ 

. ^sessment of program impact and for decision-making. Stud i es i n Educ ational 
Eval uation , 1977, 155-168. 

California v State Department of Education, test P la nn i ng Guide: S xigf^tions 
F or Avbiding Testing Proble ms . Sacramento, CA: Author, 1982, 

Carter, LaUnor F. The Sustaining Effects Study : An In terim Report , Santa 
Monica, CA: System Development Corporation, £980. TM-5693/200/00i 

Conklirp, Jonathan E^teigh Burstein and 3. Ward^Keesiing. The effects of - 

' date of testing and method of interpoia^doni on the use of standardized jest scores 
in the evaluatioft of targe scale educational programs. Journal of Educational 
Measurement ^.4579^ i£, 239-246. 

freeman, Donald 3., Ge_)fe<dla M. Belli, Andrew C. Porter ^ Robert E. Flb_deri f William H. 
Schmidt^ and 3ohn R^chwHle. The influence of different styles of textbook use of 
instructional validity of standardized tests. Journal of Educational Measurement , 
19*L|^22» 259-270. " 

Freeman, Donald 3., Therese M. Kuhs,j\hdrew C._Porter* Robert E. Flbderi, 

William H/Schmidt and 3ohn R. Schwillfe. Do textbooks and tests define a national 
curriculum in elementary school mathematics? The Elemen tary School 3ou rnml , 
1-983, 20, 501-513. 

Hbrst, Donald P._ Checklists of potential errors in the ESEA Title I Evaluation and 

Reporting System. In: BesseVi Barbara L. (Ed.K Fu rther Documentation of State 
ESEA Assistance State ESEA Title I Reporting Models and their Tech nical 
Assistance Requirements, Phase II, Volume II. RMC Report UR-331. Mountain 
J View # CA: RMQ Research Corporation, 197$^ 

3aeger* Richard M. the effect of test selection on Titie^ project impact. ' j 

Educational Evaluation and Policy An alysis , 1979 , 1(2) , 33-40. 

Leinhardt, Gaea and Andrea Mar Seewald. Overlap: What's tested, what's 
taught? 36ur'nal of Educational Measuremeat , 1981, 18, 



Linn, Robert L; Validity of inferences based on Yhe proposed Title I evaluation models. 
Educational Evaluation and Policy Analysis , 1979, 1(2) , 23-8 2> 

is 

Linn, Robert L. Discussion; Regression toward the mean and the* interval - ^ 

between test , administrations. New Directions for Testing and Measurement , 
Number 8; Measurement Aspects of Title I Evaluations , San Francisco; Jossey- 
Bass Publishers, Inc., 1980. ♦ 



Linn, Robert t. Measuring pre test-post test performance changes. In; Berk^Ronaid A., 

Johns T1c>pi<^^ : r j . 

Linn, Robert L;, Stephen B. Dunbar, Delwyn fc. Harnlsch and C. Nicholas 

Hastings^ The validity of the Title I Evaluation and Reporting System^ Chapter 
Two in; Assessment of the Title I Evaluation and Reporting System ^ Elizabeth Ri 
Reisner, Marvin C. Alkin^ Robert Fi Boruch, Jiobert L. Linn, Jason Millrnaru 
Washington, D.C.: U5i Department of Education, April 1982. 

Linn, Robert Li and Jeffrey Ai Slihde. The determination of the significance 1 . 

of change Between pre- and posttestihg periods. Review of Educational Research^ 
19^7, 47, 121-150. 

Lbref, P.G., A. Seder, J.C. Bia^hihi and C .A. Vale. Anchor Test Study;, Equivalence S 
and norm tables for selected reading achievement tests_(gradcs 5 and 6) . 
Washington, D.C.: U.S. Government Printing Office^ 197*. 

Mayer^ Victor J. arid Carolyn H. Farrisworth. The presence of a momentum effect in 
intensive time-series data on learning. Paper presented at the annual jtieeting of 
the American Educational Research Association^ Montreal* 1983. r ; 

Perry^ Marcia D. A meta-analysis of Title I/Chapter i sustained effects study. Paper 
presented at the Annual Meeting/of the American Educational Research 
* Association, Montreal, 1983. 

PbWerr, Stephen, Helen' Slaughter and Cheryl Helmick. A test of the equipercentile 
hypothesis of the TIERS Norm-Referenced Model. Journal of Educatio nal 
Measurement, 1983, 20 , 299-302. > t 

Roberts, A. 6. H. Regression toward the mean and the regression-effect bias. New 
Directions for Testing and Measurement , N u mber 8 ; Me a surement A s pe cts- of 
Title I Evaluation . San Francisco; Jossey-Bass Publishers, Inc. 1980. 



Rubin, Donald B. Estimating causal effects of treatments in experimental and 
observational studies (ETS RB 72-39). Princeton, N.J.; Educational Testing 
Service, 1972. - ~ 

Strand, T. Memorandum to Sustained Achievement Study Committee, Evahstbn, 

Illinois: Educational Testing Service ff est Information Center), October 12, 1983. 



34 



ERIC 



Tallmadge, G. Kasten. Title I evaluations: comparable birtcome. measures for dissimilar 
instructional treatments? Paper presented at the 27th Annual Conference of Directors 
of State testing Programs! Princeton, New Jersey October 1977. 

Tallmadge, G. Kasten. An empirical assessment of riorm-referericed evaluation 
methodology. Journal of Educational Measurement ^ 1982, 19, 97-112.; 

Tallmadge! G. Kasten and Barbara M. Fagari. Cognitive growth and growth Expectations 
in reading and mathematics. } RMC Report UR-326. Mountain View, CA: RMC 
Research Corp., November 1$77. 

Tallmadge, G. Kasten and Christine T. Wood. _User , s Guide; ESEA Title I Evaluation 

and Reporting System . Mountain View, CA: RMC Research Corp.^ October 1976a* 

Tallmadge, G. Kasten and Christine T. Wood. Characteristicsof Ei^ht Commonly Used 
Nationally Normed Tests. Technical Paper No. 5, ESEA Title I Evaluation and 
Reporting System. Washington, D.C.: U^. Office of Education^ October 1976a. 

Tallmadge, G. kasten and Christine T. Wood. User's Guide; ESEA Title I 

Evaluation a nd^Repdrting System. Mountain View, CA: RMC Research 
Corp., Oct. 1978. 

Walker, Decker F. and Jon Schaf farzick. Comparing Curricula. Review of 
Educational Rese arch , 197*, 83-1:12. 

Wiley; David E. Policy-responsive evaluation^ Irt^ 

Eva Baker (Eds.), Proceedings of the 1978 CSE Measurement and Methodology 
Confer ence , tos Angeles: University of California, 1979a. 

Wiley, David E; Evaluation L by aggregation: social and methodological biases. 
Educational Efraloati on and Policy Analysis , 1979b, U2), 4l-*5. 

Wiley, David E. and R; Darell Bock; Quasi-experimentatiofi in educational 
. settings: Comment. The School Review , 1967, 75, 353-366. 

Wood, Christine T. Test normirig practices arid the riorm-referericed evaluation 

j model. In: Bessey, Barbara L. (Ed.), Further Documentation of State ESEA Title 1 
Reporting Models arid their Technical Assistance Requirements, Phase IU Volume 
II. RMC Report UR-331. Mountain View, CA: RMC Research Corp.^1978. 

Wobd^ ChristirieJT. arid G. listen Tallmadge. Local Norms. Technical PapeiL _ ^ _ 
Np._ 7 r ESEA Title I Evaiuatipn_arid Reporting System. Washington* D.C.: U.S. 
Office of Education, October 1976. 




